Applying Statistical Methods to Small Corpora: Benefiting from a Limited Domain

نویسندگان

  • David Fisher
  • Ellen Riloff
چکیده

The application of statistical approaches to problems in natural language processing generally requires large (1,000,000÷ words) corpora to produce useful results. In this paper we show that a well-known statistical technique, the t test, can be applied to smaller corpora than was previously thought possible, by relying on semantic features rather than lexical items in a corpus of limited domain. We apply the t test to the problem of resolving relative pronoun antecedents, using collocation frequency data collected from the 500,000 word MUC-4 corpus. We conduct two experiments where t is calculated with lexical items and with semantic feature representations. We show that the test cases that are relevant to the MUC-4 domain produce more significant values of t than the ones that are irrelevant. We also show that the t test correctly resolves the relative pronoun in 91.07% of the relevant test cases where the value of t is significant.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sea Surfaces Scattering by Multi-Order Small-Slope Approximation: a Monte-Carlo and Analytical Comparison

L-band electromagnetic scattering from two-dimensional random rough sea surfaces are calculated by first- and second-order Small-Slope Approximation (SSA1, 2) methods. Both analytical and numerical computations are utilized to calculate incoherent normalized radar cross-section (NRCS) in mono- and bi-static cases. For evaluating inverse Fourier transform, inverse fast Fourier transform (IFFT) i...

متن کامل

سیستم شناسایی و طبقه‌بندی موجودیت‌های اسمی در متون زبان فارسی بر پایه شبکه عصبی

Named Entity Recognition (NER) is a fundamental task in natural language processing and also known as a subset of information extraction. We seek to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, etc. Named Entity Recognition for English texts has been researched widely for the past years, howev...

متن کامل

The Estimation of Powerful Language Models from Small and Large Corpora

This paper deals with the estimation of powerful statistical language models using a technique that scales from very small to very large amounts of domain-dependent data. We begin with an improved modeling of the grammar statistics, based on a combination of the backing-off technique [6] and zero-frequency techniques [2, 91. These are extended to be more amenable to our particular system. Our r...

متن کامل

صلاحیت حرفه‌ای، مربیان پیش‌دبستانی، دورۀ پیش‌دبستانی

  The main target of this research is inspection on existentstatus (extent of benefiting) and the desirable (extent of importance) and the professional qualifications of preschool’s teachers on the view of teachers and managers of preschool’s period, and it has been done by descriptive survey method.With reviewing the texts and research’s background, 12 domain (main component) and 98 subordinat...

متن کامل

Subdomain Sensitive Statistical Parsing using Raw Corpora

Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001